ProbMap - A probabilistic approach for mapping large document collections

نویسنده

  • Thomas Hofmann
چکیده

The visualization of large text databases and document collections is an important step towards more exible and interactive types of information access and retrieval. This paper presents a probabilistic approach which combines a statistical, model{ based analysis of a given set of document with a topological visualization principle. Our method can be utilized to derive topic maps, which represent topical information by characteristic keyword distributions arranged in a two{dimensional spatial layout. Combined with multi-resolution techniques this provides a three-dimensional space for interactive information navigation in large text collections.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Robust Template Identification of Scanned Documents

Identification of low-quality scanned documents is not trivial in real-world settings. Existing research mainly focusing on similarity-based approaches rely on perfect string data from a document. Also, studies using image processing techniques for document identification rely on clean data and large differences among templates. Both these approaches fail to maintain accuracy in the context of ...

متن کامل

Improving Topic Models with Latent Feature Word Representations

Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-t...

متن کامل

A Probabilistic Hierarchical Clustering Method for Organising Collections of Text Documents

In this paper a generic probabilistic framework for the unsupervised hierarchical clustering of large-scale sparse high-dimensional data collections is proposed. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data specifically both asymmetric and sy...

متن کامل

Hierarchical Bayesian Models for Applications in Information Retrieval

We present a simple hierarchical Bayesian approach to the modeling collections of texts and other large-scale data collections. For text collections, we posit that a document is generated by choosing a random set of multinomial probabilities for a set of possible “topics,” and then repeatedly generating words by sampling from the topic mixture. This model is intractable for exact probabilistic ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Intell. Data Anal.

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2000